Reading the file with all the customer churn details as a data frame.Then displaying the structure of the data frame
churn_data<-data.frame(read.csv("Bank_Customer_Churn_Prediction.csv",header=TRUE)) #reading the file into data frame
str(churn_data) #printing the data frame structure
## 'data.frame': 10000 obs. of 12 variables:
## $ customer_id : int 15634602 15647311 15619304 15701354 15737888 15574012 15592531 15656148 15792365 15592389 ...
## $ credit_score : int 619 608 502 699 850 645 822 376 501 684 ...
## $ country : chr "France" "Spain" "France" "France" ...
## $ gender : chr "Female" "Female" "Female" "Female" ...
## $ age : int 42 41 42 39 43 44 50 29 44 27 ...
## $ tenure : int 2 1 8 1 2 8 7 4 4 2 ...
## $ balance : num 0 83808 159661 0 125511 ...
## $ products_number : int 1 1 3 2 1 2 2 4 2 1 ...
## $ credit_card : int 1 0 1 0 1 1 1 1 0 1 ...
## $ active_member : int 1 1 0 0 1 0 1 0 1 1 ...
## $ estimated_salary: num 101349 112543 113932 93827 79084 ...
## $ churn : int 1 0 1 0 0 1 0 1 0 0 ...
The churn data frame as 10000 customers from the with each customer having 12 following attributes; Customer ID :The Unique ID of each individual customer
Credit Score : A number depicting the customer’s creditworthiness
Country : The country the customer banks from
Gender : The gender the customer identifies with
Age : Depicts the customers age
Tenure : Indicates how length in years the customer has been with the bank
Balance : The amount currently available in the customer’s account
Products Number : The number of products purchased by the customer through the bank
Credit Card : Indicates the customer has a credit card
Active Member : Indicates if the customer is an active or inactive
Estimated Salary : Bank Estimation of the income of the customer
Churn : Indicator of if the customer has left the bank or not
Checking for duplicates in the data frame.
nrow(unique(churn_data)) #checking for unique rows
There are 10000 rows of unique records in the data.Hence we the data frame has no duplicates records.We can proceed with next step of analysis.
Checking if there are any NA values in the data frame
sum(is.na(churn_data) == 'TRUE') # checking for NA values in the dataframe
We found that our dataset was clean without any NA values.
Checking for summary of the churn data frame.
summary(churn_data) #looking for summary of the data frame
## customer_id credit_score country gender
## Min. :15565701 Min. :350 Length:10000 Length:10000
## 1st Qu.:15628528 1st Qu.:584 Class :character Class :character
## Median :15690738 Median :652 Mode :character Mode :character
## Mean :15690941 Mean :651
## 3rd Qu.:15753234 3rd Qu.:718
## Max. :15815690 Max. :850
## age tenure balance products_number
## Min. :18.0 Min. : 0.00 Min. : 0 Min. :1.00
## 1st Qu.:32.0 1st Qu.: 3.00 1st Qu.: 0 1st Qu.:1.00
## Median :37.0 Median : 5.00 Median : 97199 Median :1.00
## Mean :38.9 Mean : 5.01 Mean : 76486 Mean :1.53
## 3rd Qu.:44.0 3rd Qu.: 7.00 3rd Qu.:127644 3rd Qu.:2.00
## Max. :92.0 Max. :10.00 Max. :250898 Max. :4.00
## credit_card active_member estimated_salary churn
## Min. :0.000 Min. :0.000 Min. : 12 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.: 51002 1st Qu.:0.000
## Median :1.000 Median :1.000 Median :100194 Median :0.000
## Mean :0.706 Mean :0.515 Mean :100090 Mean :0.204
## 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:149388 3rd Qu.:0.000
## Max. :1.000 Max. :1.000 Max. :199992 Max. :1.000
Credit score:We see that minimum credit score of the customer is 350 with average score being 651 and 850 being maximum score. Age:Minimum age of the customer is 18 with average age being 38.9 and maximum age being 92. Tenure:The customers has been with bank for an average of 5 years and upto 10 years. Balance:Maximum account balance of the customer is 250898 with average balance being 76486. Country,gender,active member,credit card,product n umber and churn are categorical variables.
Customer ID is not an important attribute that impacts churn rate
churn_data <- churn_data[-c(1)] # removing column 1(customer _id) from data set
ncol(churn_data) # counting number of variables in the data frame
str(churn_data) # checking for summary of the data frame after removing the customer_if field
We are not making analysis for each customer,Hence customer_id was irrelevant field in the data set so we decided to remove the customer_id column from the data frame. After removing customer_id attribute from the data frame we now have 11 columns.
Converting required variables into categorical values
str(churn_data)
churn_data_category=churn_data
churn_data_category$credit_card=factor(churn_data$credit_card)
churn_data_category$active_member=factor(churn_data$active_member)
churn_data_category$churn=factor(churn_data$churn)
churn_data_category$gender=factor(churn_data$gender)
churn_data_category$tenure =factor(churn_data$tenure )
churn_data_category$products_number=factor(churn_data$products_number)
churn_data_category$age=factor(churn_data$age)
str(churn_data_category)
We decided to convert these variables into categorical
(factor):
credit_card, active_member, churn
and
gender,product_number,tenure,age.
Notice that the data frame churn_data still has all
variables numerical, while the data frame
churn_data_category include categorical columns that we
just converted.
Converting Boolean values into Character format
churn_data$active_member[churn_data$active_member== 1]<-"Active"
churn_data$active_member[churn_data$active_member== 0]<-"In Active"
churn_data$credit_card[churn_data$credit_card== 1]<-"Credit Card"
churn_data$credit_card[churn_data$credit_card== 0]<-"No-Credit Card"
churn_data$churn[churn_data$churn== 1]<-"Churned"
churn_data$churn[churn_data$churn== 0]<-"Retained"
head(churn_data)
We decided to assign string values to the corresponding Boolean values for the few variables in the dataset,So that it would be much better to understand each variable clearly.
Checking for outliers in the data frame for credit_score,age,tenure,balance,estimated_salary variables using outlierKD function.
credit_score_clean <- outlierKD2(churn_data, credit_score, rm = TRUE, boxplt = TRUE, qqplt = TRUE)
xkablesummary(credit_score_clean)
The proportion outliers for credit_score variable if 0.2% and it was removed successfully using outlierKD function.
age_clean <- outlierKD2(churn_data, age, rm = TRUE, boxplt = TRUE, qqplt = TRUE)
xkablesummary(age_clean)
tenure_clean <- outlierKD2(churn_data, tenure, rm = TRUE, boxplt = TRUE, qqplt = TRUE)
xkablesummary(tenure_clean)
balance_clean <- outlierKD2(churn_data, balance, rm = TRUE, boxplt = TRUE, qqplt = TRUE)
xkablesummary(balance_clean)
salary_clean <- outlierKD2(churn_data, estimated_salary, rm = TRUE, boxplt = TRUE, qqplt = TRUE)
xkablesummary(salary_clean)
By using outlierKD function we can observe that outliers where found
only in age and credit_score variables
i.e,3.7% and 0.2%,So we decided to remove using outlierKD function.
Analyzing each variable in the dataframe by using the relevant plots along with the calculated mean,SD and percentages.
library(ggplot2)
ggplot(data=churn_data, aes(x=credit_score)) +
geom_histogram(col="black",bins=30,
fill="dark orange",
alpha =.7) + # opacity
labs(title="`ggplot`") +
labs(title="Customer's Credit Scores",x="Customer Credit Score", y="Customer Count")
print(mean(churn_data$credit_score))
print(sd(churn_data$credit_score))
The average credit score of the customer is
650.529,most of the customer having credit score fall between 600 to 700
and standard deviation is 96.653.
ggplot(data=churn_data,aes(x=country,fill=country))+
geom_bar(col="black")+
scale_fill_brewer(palette="Reds") +
labs(title = "Bank Customer vs Country",x="Country",y="Customer Count")+
theme_minimal()
The customers are grouped by the countries in which they have their accounts.As we see from the plot France has more than 50% of customer account which is the highest among all other countries with Germany and Spain sharing equal percentages.
ggplot(churn_data, aes(x = age,fill=cut(age,100))) +
geom_histogram(show.legend = FALSE,col="black",bins=30)+
scale_fill_discrete(h=c(240,10),c=120,l=70)+
theme_minimal()+
labs(x=" Customer Age",y=" Customer Count")+
ggtitle("Customer ages")
print(mean(churn_data$age))
print(sd(churn_data$age))
The majority of the bank customer’s fall below the age of 50 with average age of 38.922 and with standard deviation of 10.488.
library(ggplot2)
ggplot(data=churn_data, aes(x=tenure)) +
geom_histogram(col="black",bins=11,
fill="Yellow",
alpha =.7) +
labs(title="Years spent as customer",x="Number of years", y="Customer Count")
Most of the customer as been with bank for more than a year.
df <- churn_data %>%
group_by(gender) %>% # Variable to be transformed
count() %>%
ungroup() %>%
mutate(perc = `n` / sum(`n`)) %>%
arrange(perc) %>%
mutate(labels = scales::percent(perc))
str(df)
ggplot(df, aes(x = "", y = perc, fill =gender )) +
geom_col(color="black") +
geom_text(aes(label = labels),color = c("black", "black"),
position = position_stack(vjust = 0.5)) +
coord_polar(theta = "y")+
labs(title="Percentage of Male and Female customers in bank",x="", y="")
The Male customers are highest in percent of 55% and remaining 45% are Female customers.
df <- churn_data %>%
group_by(active_member) %>% # Variable to be transformed
count() %>%
ungroup() %>%
mutate(perc = `n` / sum(`n`)) %>%
arrange(perc) %>%
mutate(labels = scales::percent(perc))
str(df)
ggplot(df, aes(x = "", y = perc, fill = active_member)) +
geom_col(color="black") +
geom_label(aes(label = labels),color = c("black", "white"),
position = position_stack(vjust = 0.5),
show.legend = FALSE) +
coord_polar(theta = "y")+
scale_fill_grey()+
labs(title="Percentage of active members in bank",x="", y="")
Huge percentage of customer are being inactive,it is reported that 48.5% of customers are being inactive.
df <- churn_data %>%
group_by(credit_card) %>% # Variable to be transformed
count() %>%
ungroup() %>%
mutate(perc = `n` / sum(`n`)) %>%
arrange(perc) %>%
mutate(labels = scales::percent(perc))
str(df)
ggplot(df, aes(x = "", y = perc, fill = credit_card)) +
geom_col(color="black") +
geom_label(aes(label = labels),color = c("black", "black"),
position = position_stack(vjust = 0.5),
show.legend = FALSE) +
coord_polar(theta = "y")+
scale_fill_brewer(palette="Greens")+
labs(title="Percentage of customer use credit card",x="", y="")
Predominantly 71% of the Bank customer use credit_card and only 29% do not make use of it.
df <- churn_data %>%
group_by(products_number) %>% # Variable to be transformed
count() %>%
ungroup() %>%
mutate(perc = `n` / sum(`n`)) %>%
arrange(perc) %>%
mutate(labels = scales::percent(perc))
str(df)
ggplot(df, aes(x = "", y = perc, fill = products_number)) +
geom_col(color="black") +
geom_label(aes(label = labels),color = c("white", "white","white","white"),
position = position_stack(vjust = 0.5),
show.legend = FALSE) +
coord_polar(theta = "y")+
labs(title="Percentage of different products used by customers ",x="", y="")
Most of the customer user product 1 which is 50% and product 4 is the least used with 0.6%.
df <- churn_data %>%
group_by(churn) %>%
count() %>%
ungroup() %>%
mutate(perc = `n` / sum(`n`)) %>%
arrange(perc) %>%
mutate(labels = scales::percent(perc))
str(df)
ggplot(df, aes(x = "", y = perc, fill = churn)) +
geom_col(color="black") +
geom_label(aes(label = labels),color = c("black", "black"),
position = position_stack(vjust = 0.5),
show.legend = FALSE) +
coord_polar(theta = "y")+
scale_fill_brewer(palette="Purples")+
labs(title="Percentage of customer retained",x="", y="")
The bank managed to retain 80% of their customers with the remaining 20% where churned out.
Inferential statistics for customer credit scores
mean.credit_score = mean(churn_data$credit_score, na.rm = TRUE); #mean of the customer credit scores
str(mean.credit_score)
## num 651
sd_credit_score = sd(churn_data$credit_score)
str(sd_credit_score)
## num 96.7
Mean and SD for the entire population of credit_score in churn_data data frame mean=650.53 and sd=96.653.
Now let us create a smaller sample of the credit score out of the entire population, and call it churn_credit_sample.
set.seed(321)
churn_credit_sample = churn_data[ sample(nrow(churn_data),4000), ]
format(mean(churn_credit_sample$credit_score), digits=5)
## [1] "650.19"
loadPkg("BSDA")
ztestcredit95 = z.test(churn_credit_sample$credit_score, sigma.x = 96.7) # default conf.level = 0.95
ztestcredit95
ztestcredit99 = z.test(churn_credit_sample$credit_score, sigma.x = 96.7, conf.level=0.99 )
ztestcredit99
ztestcredit50 = z.test(churn_credit_sample$credit_score, sigma.x = 96.7, conf.level=0.50 )
ztestcredit50
ztestcredit95$conf.int
ztestcredit99$conf.int
ztestcredit50$conf.int
loadPkg("BSDA")
ttestcredit95 = t.test(churn_credit_sample$credit_score) # default conf.level = 0.95
ttestcredit95
ttestcredit99 = t.test(churn_credit_sample$credit_score, conf.level=0.99 )
ttestcredit99
ttestcredit50 = t.test(churn_credit_sample$credit_score, conf.level=0.50 )
ttestcredit50
ttestcredit95$conf.int
ttestcredit99$conf.int
ttestcredit50$conf.int
Inferential statistics for customer account balance.
mean.balance= mean(churn_data$balance, na.rm = TRUE); #mean of the customer account balance
str(mean.balance)
## num 76486
sd_balance = sd(churn_data$balance)
str(sd_balance)
## num 62397
Mean and SD for the entire population of credit_score in churn_data data frame mean=76486 and sd=62397.
Now let us create a smaller sample of the account balance out of the entire population, and call it churn_balance_sample.
set.seed(321)
churn_balance_sample = churn_data[ sample(nrow(churn_data),4000), ]
format(mean(churn_balance_sample$balance), digits=5)
## [1] "75942"
loadPkg("BSDA")
ztestbalance95 = z.test(churn_balance_sample$balance, sigma.x = 62397.4) # default conf.level = 0.95
ztestbalance95
ztestbalance99 = z.test(churn_balance_sample$balance, sigma.x = 62397.4, conf.level=0.99 )
ztestbalance99
ztestbalance50 = z.test(churn_balance_sample$balance, sigma.x = 62397.4, conf.level=0.50 )
ztestbalance50
ztestbalance95$conf.int
ztestbalance99$conf.int
ztestbalance50$conf.int
loadPkg("BSDA")
ttestbalance95 = t.test(churn_balance_sample$balance) # default conf.level = 0.95
ttestbalance95
ttestbalance99 = t.test(churn_balance_sample$balance, conf.level=0.99 )
ttestbalance99
ttestbalance50 = t.test(churn_balance_sample$balance, conf.level=0.50 )
ttestbalance50
ttestbalance95$conf.int
ttestbalance99$conf.int
ttestbalance50$conf.int
We are doing chi-square test to determine if the churn rate depends on Gender?
Chi Square Test for independence
H0: Customer’s gender and churn are independent
H1: Customer’s gender and churn are not independent
#Contingency table
contab = table(churn_data$churn, churn_data$gender)
xkabledply(contab, title="Contingency table for Gender(Male, Female) vs Churn (0 or 1)")
#Chisquare Test of independency
chitests = chisq.test(contab)
chitests
chitests$statistic
chitests$parameter
chitests$p.value
We have \(\chi^2\) value of the test
from chitests$statistic = 112.919,while extracting the
p-value from chitests$p.value = 2.248^{-26} which is less
than significance level of 0.05 ,Hence we reject the null hypothesis H0
and we conclude that the churn rate dependent on gender of the
customer.
ggplot(data=churn_data,aes(x=gender,fill=churn))+
geom_bar(col="black")+
scale_fill_manual('Position', values=c('red', 'lightblue'))+
labs(title = "Gender vs Churn",x="Gender",y="Chrun")+
theme_minimal()
We observed that majority of the churned customers were female despite the total population of the bank being predominantly male.
Chi Square Test for independence
H0: Having a credit card and churn are independent
H1: Having a credit card and churn are not independent
churn_data[, 10:11][churn_data[, 10:11] == 1] <- 'Churn'
churn_data[, 10:11][churn_data[, 10:11] == 0] <- 'No Churn'
cc_churn <- table(churn_data$churn, churn_data$credit_card)
cc_churn
chitestccchurn <- chisq.test(cc_churn)
chitestccchurn
ggplot(data=churn_data,aes(x=credit_card,fill=churn))+
geom_bar()+
scale_fill_manual('Position', values=c('red', 'lightblue')) +
labs(title = "Customer Churn in Customers With and Without Credit Cards",x="Credit Card",y="Churn")+
theme_minimal()
Credit Card vs Churn P-value of 0.5 is much higher that the significance level (0.05 for df=1). Thus, we reject the null hypothesis H0, as there is a 0.5 or 50% chance of these results occurring by chance.
Chi Square Test for independence
H0: Country and churn are independent
H1: Country and churn are not independent
#Contingency table
contable = table(churn_data$country, churn_data$churn)
xkabledply(contable, title="Contingency table for Country (Customer belongs to) vs Churn (0 or 1)")
#Chisquare Test of independency
chitest <- chisq.test(contable)
chitest
chitest$statistic
chitest$parameter
chitest$p.value
### 7290530e7885e0b0a98352236618e34354265333
We have the \(\chi^2\) value of the
test from chitests$statistic = 112.919, while the p-value
is less than the significance level of 0.05 from
chitests$p.value = 2.248^{-26}. As a result, we reject the
null hypothesis H0 and find that the churn rate is reliant on the
customer’s regions.
ggplot(data=churn_data,aes(x=country,fill=churn))+
geom_bar()+
scale_fill_manual('Position', values=c('red', 'lavenderblush2')) +
labs(title = "Customer Churn in Different Countries",x="Country",y="Churn")+
theme_minimal()
We can notice that most of the customers from germany and france are about to churned when compared to customers from spain.
Chi Square Test for independence
H0: Banking products, services and churn are independent
H1: Banking products, services and churn are not independent
#Contingency table
contab_productnumber = table(churn_data$churn, churn_data$products_number)
xkabledply(contab_productnumber, title="Contingency table for Banking Services vs Churn (0 or 1)")
#Chisquare Test of independency
chitests = chisq.test(contab_productnumber)
chitests
chitests$statistic
chitests$parameter
We have the \(\chi^2\) value of the
test from chitests$statistic = 1503.629, while the p-value
is less than the significance level of 0.05 from
chitests$p.value = 0. As a result, we reject the null
hypothesis H0 and find that the churn rate is reliant on the customer’s
active status.
ggplot(data=churn_data,aes(x=products_number,fill=churn))+
geom_bar()+
scale_fill_manual('Position', values=c('red', 'lavender')) +
labs(title = "Banking Product and Services vs Churn",x="Banking Product and Services",y="Chrun")+
theme_minimal()
We can notice that most of the customers with lower and higher products i.e) 1, 3 and 4 are about to churned when compared to customers with two products.
Chi Square Test for independence
H0: Customer’s active status and churn are independent
H1: Customer’s active status and churn are not independent
#Contingency table
contab_activemember = table(churn_data$churn, churn_data$active_member)
xkabledply(contab_activemember, title="Contingency table for Account Status(Active, In Active) vs Churn (0 or 1)")
#Chisquare Test of independency
chitests = chisq.test(contab_activemember)
chitests
chitests$statistic
chitests$parameter
chitests$p.value
We have the \(\chi^2\) value of the
test from chitests$statistic = 242.985, while the p-value
is less than the significance level of 0.05 from
chitests$p.value = 8.786^{-55}. As a result, we reject the
null hypothesis H0 and find that the churn rate is reliant on the
customer’s banking services.
ggplot(data=churn_data,aes(x=active_member,fill=churn))+
geom_bar()+
scale_fill_manual('Position', values=c('red', 'pink1')) +
labs(title = "Active Members vs Churn",x="Active Status of Account Holders",y="Churn")+
theme_minimal()
Here, we can interpret that most of the Inactive customers are about to churned.
Correlation test between account balance and churn
library(corrplot)
churn_data_to_numeric <- churn_data
churn_data_to_numeric$churn <- as.numeric(churn_data_to_numeric$churn)
churn_data_numeric <- select_if(churn_data_to_numeric, is.numeric)
cors_bal <- cor(churn_data_numeric)
corrplot(cors_bal, method = 'number')
cor_balance<-cor.test(churn_data$balance,as.numeric(churn_data$churn), method="pearson")
cor_balance
Customer churn is weakly correlated with account balance with a score of 0.12 For further analysis we can check the correlation score by combining multiple variables
loadPkg("BSDA")
ttestbalances95 = t.test(churn_balance_sample$credit_score) # default conf.level = 0.95
ttestbalances95
ttestbalances99 = t.test(churn_balance_sample$credit_score, conf.level=0.99 )
ttestbalances99
ttestbalances50 = t.test(churn_balance_sample$credit_score, conf.level=0.50 )
ttestbalances50
ttestbalances95$conf.int
ttestbalances99$conf.int
ttestbalances50$conf.int
Correlation test between credit score and churn
library(corrplot)
churn_data_to_numeric <- churn_data
churn_data$churn[churn_data$churn== "Churned"]<-1
churn_data$churn[churn_data$churn== "Retained"]<-0
churn_data_to_numeric$churn <- as.numeric(churn_data_to_numeric$churn)
str(churn_data_to_numeric)
churn_data_numeric <- select_if(churn_data_to_numeric, is.numeric)
str(churn_data_numeric)
cors_credit <- cor(churn_data_numeric)
corrplot(cors_credit, method = 'number')
cor_scores <- cor(churn_data_numeric$credit_score, churn_data_numeric$churn)
cor_scores
The credit score is weakly correlated with customer churn with a score of -0.0271
library(corrplot)
churn_data_to_numeric <- churn_data
churn_data$churn[churn_data$churn== "Churned"]<-1
churn_data$churn[churn_data$churn== "Retained"]<-0
churn_data_to_numeric$churn <- as.numeric(churn_data_to_numeric$churn)
str(churn_data_to_numeric)
churn_data_numeric <- select_if(churn_data_to_numeric, is.numeric)
str(churn_data_numeric)
cors_age_group <- cor(churn_data_numeric)
corrplot(cors_age_group, method = 'number')
cor_age <- cor(churn_data_numeric$age, churn_data_numeric$churn)
cor_age
loadPkg("BSDA")
ttestage95 = t.test(churn_balance_sample$age) # default conf.level = 0.95
ttestage95
ttestage99 = t.test(churn_balance_sample$age, conf.level=0.99 )
ttestage99
ttestage50 = t.test(churn_balance_sample$age, conf.level=0.50 )
ttestage50
ttestage95$conf.int
ttestage99$conf.int
ttestage50$conf.int
ggplot(churn_data, aes(x=age, fill=churn)) +
geom_histogram( color='#e9ecef', alpha=0.5, position='identity') + scale_fill_manual('Position', values=c('red', 'pink1')) +
labs(title = "Age Distribution across Churned and No Churn Account Holders",x="Age of Account Holders",y="Frequency")
#Contingency table
churn_age = table(churn_data$churn, churn_data$age)
churn_age
#Chisquare Test of independence
chichurn_age = chisq.test(churn_age)
chichurn_age
chichurn_age$statistic
chichurn_age$parameter
chichurn_age$p.value
*We observed that correlation of age variable is very high,so we decided to combine different columns and find if the correlation score improves or not
corre<-cor(churn_data_numeric)
corre
corrplot(corre, method = "number")
Churn & Age are about to moderately correlated (Positive) - 0.29.
corr <- cor(churn_data_numeric$age, churn_data_numeric$churn)
corr
## [1] 0.285
corr1 <- cor(churn_data_numeric$age + churn_data_numeric$tenure, churn_data_numeric$churn)
corr1
## [1] 0.272
corr2 <- cor(churn_data_numeric$age + churn_data_numeric$products_number, churn_data_numeric$churn)
corr2
## [1] 0.283
corr3 <- cor(churn_data_numeric$age + churn_data_numeric$balance, churn_data_numeric$churn)
corr3
## [1] 0.119
corr4 <- cor(churn_data_numeric$age + churn_data_numeric$tenure + churn_data_numeric$products_number, churn_data_numeric$churn)
corr4
## [1] 0.269
corr5 <- cor(churn_data_numeric$age + churn_data_numeric$products_number + churn_data_numeric$balance, churn_data_numeric$churn)
corr5
## [1] 0.119
The correlation score is the highest for the age variable. After combining different columns and finding the correlation score we see that it starts decreasing from 0.285
Since data set is unbalanced we decided to use oversampling technique to balance the dataset using library called ROSE
The data set was unbalance majorly because the variable country had most of the data which belongs to France,Hence we need to do perform data balancing.
table(churn_data$churn)
table(churn_data$country)
library(ROSE)
sum(churn_data$country == 'France')
## [1] 5014
sum(churn_data$country == 'Germany')
## [1] 2509
sum(churn_data$country == 'Spain')
## [1] 2477
ind <- sample(2, nrow(churn_data), replace = TRUE, prob = c(0.7, 0.3))
train <- churn_data[ind==1,]
test <- churn_data[ind==2,]
data_balanced_over <- ovun.sample(churn ~ ., data = train, method = "over",N = 10000,seed=123)$data
table(data_balanced_over$churn)
table(data_balanced_over$country)
ggplot(data=data_balanced_over,aes(x=country,fill=country))+
geom_bar(col="black")+
scale_fill_brewer(palette="Reds") +
labs(title = "Bank Customer vs Country",x="Country",y="Customer Count")+
theme_minimal()
Let us see the probability of customer are churned/not churned.
prop.table( table(churn_data$churn) )
we observe that his probability table tells you that around 20 percent of the customers are churned. If those customers are all the ones that are having good relationship with the bank, then the bank is going to face severe strike on its business rate. Let’s use the different modelling methods to train our dataset to see if we can find out what’s causing customers to close the account with the particular bank.
str(churn_data)
churn_data_pc<-data_balanced_over
churn_data_pc$country[churn_data$country == 'France'] <- 0
churn_data_pc$country[churn_data$country == 'Germany'] <- 1
churn_data_pc$country[churn_data$country == 'Spain'] <- 2
churn_data_pc$gender[churn_data$gender == 'Male'] <- 0
churn_data_pc$gender[churn_data$gender == 'Female'] <- 1
churn_data_pc$credit_card[churn_data$credit_card == 'No-Credit Card'] <- 0
churn_data_pc$credit_card[churn_data$credit_card == 'Credit Card'] <- 1
churn_data_pc$active_member[churn_data$active_member == 'In Active'] <- 0
churn_data_pc$active_member[churn_data$active_member == 'Active'] <- 1
churn_data_pc$active_member = as.numeric(churn_data_pc$active_member)
churn_data_pc$credit_card = as.numeric(churn_data_pc$credit_card)
churn_data_pc$country = as.numeric(churn_data_pc$country)
churn_data_pc$credit_score = as.numeric(churn_data_pc$credit_score)
churn_data_pc$gender = as.numeric(churn_data_pc$gender)
churn_data_pc$churn = as.numeric(churn_data_pc$churn)
churn_data_pc_target = churn_data_pc #includes 'churn'
churn_data_logit <- churn_data_pc_target
str(churn_data_logit)
churn_data_pc <- churn_data_logit[-c(11)] #No churn
str(churn_data_pc)
str(churn_data_pc_target)
str(churn_data_pc)
churn_data_scale <- data.frame(scale(churn_data_pc))
xkabledply(cor(churn_data_scale))
xkabledply(cov(churn_data_scale))
churnLM <- lm(churn~country+age+gender,data=churn_data_pc_target)
summary(churnLM)
##
## Call:
## lm(formula = churn ~ country + age + gender, data = churn_data_pc_target)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.268 -0.373 -0.214 0.463 0.927
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.218369 0.019199 -11.37 <2e-16 ***
## country -0.005953 0.005610 -1.06 0.29
## age 0.016364 0.000436 37.58 <2e-16 ***
## gender -0.013606 0.009324 -1.46 0.14
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.464 on 9996 degrees of freedom
## Multiple R-squared: 0.124, Adjusted R-squared: 0.124
## F-statistic: 471 on 3 and 9996 DF, p-value: <2e-16
xkabledply(churnLM, title = "churn Model summary")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -0.2184 | 0.0192 | -11.37 | 0.000 |
| country | -0.0060 | 0.0056 | -1.06 | 0.289 |
| age | 0.0164 | 0.0004 | 37.58 | 0.000 |
| gender | -0.0136 | 0.0093 | -1.46 | 0.144 |
confint(churnLM)
## 2.5 % 97.5 %
## (Intercept) -0.2560 -0.18074
## country -0.0170 0.00504
## age 0.0155 0.01722
## gender -0.0319 0.00467
churnLM2 <- lm(churn~products_number+estimated_salary+balance+active_member,data=churn_data_pc_target)
summary(churnLM2)
##
## Call:
## lm(formula = churn ~ products_number + estimated_salary + balance +
## active_member, data = churn_data_pc_target)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.633 -0.460 -0.325 0.521 0.704
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.17e-01 1.81e-02 17.51 <2e-16 ***
## products_number -6.23e-03 7.58e-03 -0.82 0.411
## estimated_salary 1.86e-07 8.54e-08 2.18 0.029 *
## balance 1.24e-06 8.18e-08 15.13 <2e-16 ***
## active_member 1.54e-02 9.80e-03 1.57 0.117
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.49 on 9995 degrees of freedom
## Multiple R-squared: 0.0247, Adjusted R-squared: 0.0243
## F-statistic: 63.2 on 4 and 9995 DF, p-value: <2e-16
xkabledply(churnLM2, title = "churn Model summary")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 0.3175 | 0.0181 | 17.515 | 0.0000 |
| products_number | -0.0062 | 0.0076 | -0.823 | 0.4107 |
| estimated_salary | 0.0000 | 0.0000 | 2.183 | 0.0291 |
| balance | 0.0000 | 0.0000 | 15.131 | 0.0000 |
| active_member | 0.0154 | 0.0098 | 1.566 | 0.1173 |
confint(churnLM2)
## 2.5 % 97.5 %
## (Intercept) 2.82e-01 3.53e-01
## products_number -2.11e-02 8.62e-03
## estimated_salary 1.90e-08 3.54e-07
## balance 1.08e-06 1.40e-06
## active_member -3.86e-03 3.46e-02
pr.out=prcomp(churn_data_pc, scale =TRUE)
print("Case: z-score/scaled")
summary(pr.out)
pr.out$rotation
nrow(train)
prop.table( table(train$churn) )
biplot(pr.out, scale = 0)
biplot(pr.out,2:3, scale =0)
biplot(pr.out,3:4, scale =0)
pr.var <- (pr.out$sdev^2)
pve <- pr.var/sum(pr.var)
plot(cumsum(pve), xlab="Principal Component (standardized)", ylab ="Cumulative Proportion of Variance Explained",ylim=c(0,1),type="b")
Feature selection using Exhaustive search
loadPkg("leaps")
reg.churn <- regsubsets(churn~., data = churn_data_pc_target, nvmax =9, nbest = 1, method = "exhaustive")
plot(reg.churn , scale = "adjr2", main = "Adjusted R^2")
plot(reg.churn , scale = "r2", main = "R^2")
plot(reg.churn , scale = "bic", main = "BIC")
plot(reg.churn , scale = "Cp", main = "Cp")
summary(reg.churn )
Using exhaustive search age,balance were selected which is a two variable model with adjusted R^2 value 0.14.
The best model selected using BIC is credit_score,age,balance which is a 3 variable model with BIC value -1500.
The best model selected using Cp is credit_score,age,tenure,balance,estimated_salary which is a 5 variable model. with Cp value 7.
loadPkg("car")
# Adjusted R2
car::subsets(reg.churn, statistic="adjr2", legend = FALSE, min.size = 2, main = "Adjusted R^2")
subsets(reg.churn, statistic="cp", legend = FALSE, min.size = 4, main = "Mallow Cp")
abline(a = 1, b = 1, lty = 3)
The Mallow Cp plot selected tw0 best models.
1)credit_score,country,gender,age,tenure,balance,active_member,estimated salary 8 variable model 2)credit_score,country,gender,age,tenure,balance,estimated_salary 7 variable model.
Feature selection using forward search
reg.churnforward <- regsubsets(churn~., data = churn_data_pc_target, nvmax =9, nbest = 1, method = "forward")
plot(reg.churnforward , scale = "adjr2", main = "Adjusted R^2")
plot(reg.churnforward , scale = "bic", main = "BIC")
plot(reg.churnforward , scale = "Cp", main = "Cp")
summary(reg.churnforward)
The best model selected using forward search is age,balance which is a 2 variable model with adjusted R^2 value of 0.14.
The best model selected using BIC is credit_card,age,balance which is a 3 variable model with value -1500 Cp.
The best model selected using Cp is credit_score,age,tenure,balance,estimated_salary which is a 5 variable model with Cp value 7.
Feature selection using backward search
reg.churnback <- regsubsets(churn~., data = churn_data_pc_target, method = "backward", nvmax = 9, nbest = 1)
plot(reg.churnback, scale = "adjr2", main = "Adjusted R^2")
plot(reg.churnback, scale = "bic", main = "BIC")
plot(reg.churnback, scale = "Cp", main = "Cp")
summary(reg.churnback)
The best model selected using backward search is age,balance which is a 2 variable model with adjusted R^2 value of 0.14.
The best model selected using BIC is credit_card,age,balance which is a 3 variable model with value -1500 Cp.
The best model selected using Cp is credit_score,age,tenure,balance,estimated_salary which is a 5 variable model with Cp value 7.
Feature selection using Sequential Replacement
reg.churnseqrep <- regsubsets(churn~., data =churn_data_pc_target, nvmax = 9, nbest = 1 , method = "seqrep")
plot(reg.churnseqrep, scale = "adjr2", main = "Adjusted R^2")
plot(reg.churnseqrep, scale = "bic", main = "BIC")
plot(reg.churnseqrep, scale = "Cp", main = "Cp")
The best model selected using sequential search is age,balance which is a 2 variable model with adjusted R^2 value of 0.14.
The best model selected using BIC is credit_card,age,balance,estimated_salary which is a 4 variable model with value -1500 Cp.
The best model selected using Cp is credit_score,age,tenure,balance,estimated_salary which is a 5 variable model with Cp value 7.
churn_logit <- glm(churn ~., data = churn_data_logit, family = 'binomial')
summary(churn_logit)
##
## Call:
## glm(formula = churn ~ ., family = "binomial", data = churn_data_logit)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.758 -0.944 -0.640 1.079 2.183
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.12e+00 1.97e-01 -15.87 < 2e-16 ***
## credit_score -8.49e-04 2.24e-04 -3.79 0.00015 ***
## country -3.36e-02 2.64e-02 -1.27 0.20351
## gender -6.32e-02 4.39e-02 -1.44 0.14990
## age 7.47e-02 2.29e-03 32.57 < 2e-16 ***
## tenure -1.56e-02 7.47e-03 -2.09 0.03657 *
## balance 5.19e-06 3.70e-07 14.04 < 2e-16 ***
## products_number -2.21e-02 3.32e-02 -0.67 0.50570
## credit_card -4.12e-02 4.80e-02 -0.86 0.39019
## active_member 4.76e-02 4.37e-02 1.09 0.27600
## estimated_salary 1.01e-06 3.81e-07 2.65 0.00817 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13698 on 9999 degrees of freedom
## Residual deviance: 12152 on 9989 degrees of freedom
## AIC: 12174
##
## Number of Fisher Scoring iterations: 4
xkabledply( confint(churn_logit), title = "CIs using profiled log-likelihood" )
| 2.5 % | 97.5 % | |
|---|---|---|
| (Intercept) | -3.5122 | -2.7400 |
| credit_score | -0.0013 | -0.0004 |
| country | -0.0854 | 0.0181 |
| gender | -0.1493 | 0.0228 |
| age | 0.0703 | 0.0793 |
| tenure | -0.0303 | -0.0010 |
| balance | 0.0000 | 0.0000 |
| products_number | -0.0872 | 0.0430 |
| credit_card | -0.1351 | 0.0528 |
| active_member | -0.0380 | 0.1333 |
| estimated_salary | 0.0000 | 0.0000 |
xkabledply( confint.default(churn_logit), title = "CIs using standard errors" )
| 2.5 % | 97.5 % | |
|---|---|---|
| (Intercept) | -3.5109 | -2.7388 |
| credit_score | -0.0013 | -0.0004 |
| country | -0.0853 | 0.0182 |
| gender | -0.1493 | 0.0228 |
| age | 0.0702 | 0.0792 |
| tenure | -0.0303 | -0.0010 |
| balance | 0.0000 | 0.0000 |
| products_number | -0.0872 | 0.0430 |
| credit_card | -0.1352 | 0.0528 |
| active_member | -0.0381 | 0.1333 |
| estimated_salary | 0.0000 | 0.0000 |
loadPkg("regclass")
xkabledply( confusion_matrix(churn_logit), title = "Confusion matrix from Logit Model" )
| Predicted 0 | Predicted 1 | Total | |
|---|---|---|---|
| Actual 0 | 4589 | 1052 | 5641 |
| Actual 1 | 1959 | 2400 | 4359 |
| Total | 6548 | 3452 | 10000 |
loadPkg("pROC")
prob=predict(churn_logit, type = "response" )
churn_data_logit$prob=prob
h <- roc(churn~prob, data = churn_data_logit)
auc(h)
## Area under the curve: 0.745
plot(h)
Based on the models obtained from feature selection we are using logistic regression to check the churn rate of the customers.
churnglm1 <- glm(churn ~ age + balance, data = churn_data_logit)
summary(churnglm1)
##
## Call:
## glm(formula = churn ~ age + balance, data = churn_data_logit)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.302 -0.371 -0.190 0.453 0.952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.05e-01 1.88e-02 -16.2 <2e-16 ***
## age 1.60e-02 4.32e-04 37.1 <2e-16 ***
## balance 1.10e-06 7.51e-08 14.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.211)
##
## Null deviance: 2458.9 on 9999 degrees of freedom
## Residual deviance: 2109.6 on 9997 degrees of freedom
## AIC: 12826
##
## Number of Fisher Scoring iterations: 2
xkabledply( confusion_matrix(churnglm1), title = "Confusion matrix from Logit Model-1" )
| Predicted 0 | Predicted 1 | Total | |
|---|---|---|---|
| Actual 0 | 4633 | 1008 | 5641 |
| Actual 1 | 2026 | 2333 | 4359 |
| Total | 6659 | 3341 | 10000 |
prob1=predict(churnglm1, type = "response" )
churn_data_logit$prob=prob1
h1 <- roc(churn~prob, data = churn_data_logit)
auc(h1)
## Area under the curve: 0.744
plot(h1)
churnglm2 <- glm(churn ~ credit_card+age + balance, data = churn_data_logit)
summary(churnglm2)
##
## Call:
## glm(formula = churn ~ credit_card + age + balance, data = churn_data_logit)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.299 -0.372 -0.189 0.453 0.955
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.98e-01 2.01e-02 -14.80 <2e-16 ***
## credit_card -9.25e-03 1.01e-02 -0.92 0.36
## age 1.60e-02 4.32e-04 37.09 <2e-16 ***
## balance 1.10e-06 7.52e-08 14.65 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.211)
##
## Null deviance: 2458.9 on 9999 degrees of freedom
## Residual deviance: 2109.4 on 9996 degrees of freedom
## AIC: 12827
##
## Number of Fisher Scoring iterations: 2
xkabledply( confusion_matrix(churnglm2), title = "Confusion matrix from Logit Model-2" )
| Predicted 0 | Predicted 1 | Total | |
|---|---|---|---|
| Actual 0 | 4622 | 1019 | 5641 |
| Actual 1 | 1996 | 2363 | 4359 |
| Total | 6618 | 3382 | 10000 |
prob2=predict(churnglm2, type = "response" )
churn_data_logit$prob=prob2
h2 <- roc(churn~prob, data = churn_data_logit)
auc(h2)
## Area under the curve: 0.744
plot(h2)
churnglm3 <- glm(churn ~ credit_score+age+tenure+ balance + estimated_salary, data = churn_data_logit)
summary(churnglm3)
##
## Call:
## glm(formula = churn ~ credit_score + age + tenure + balance +
## estimated_salary, data = churn_data_logit)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.310 -0.372 -0.182 0.455 0.984
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.91e-01 3.78e-02 -5.05 4.5e-07 ***
## credit_score -1.83e-04 4.70e-05 -3.89 9.9e-05 ***
## age 1.60e-02 4.31e-04 37.14 < 2e-16 ***
## tenure -3.29e-03 1.57e-03 -2.10 0.0359 *
## balance 1.10e-06 7.51e-08 14.70 < 2e-16 ***
## estimated_salary 2.10e-07 8.00e-08 2.63 0.0086 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.211)
##
## Null deviance: 2458.9 on 9999 degrees of freedom
## Residual deviance: 2103.9 on 9994 degrees of freedom
## AIC: 12805
##
## Number of Fisher Scoring iterations: 2
xkabledply( confusion_matrix(churnglm3), title = "Confusion matrix from Logit Model-3" )
| Predicted 0 | Predicted 1 | Total | |
|---|---|---|---|
| Actual 0 | 4624 | 1017 | 5641 |
| Actual 1 | 2022 | 2337 | 4359 |
| Total | 6646 | 3354 | 10000 |
prob3=predict(churnglm3, type = "response" )
churn_data_logit$prob=prob3
h3 <- roc(churn~prob, data = churn_data_logit)
auc(h3)
## Area under the curve: 0.745
plot(h3)
churnglm4 <- glm(churn ~ credit_score+country+gender+age+tenure+balance+estimated_salary, data = churn_data_logit)
summary(churnglm4)
##
## Call:
## glm(formula = churn ~ credit_score + country + gender + age +
## tenure + balance + estimated_salary, data = churn_data_logit)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.301 -0.372 -0.181 0.453 0.992
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.81e-01 3.82e-02 -4.75 2.1e-06 ***
## credit_score -1.81e-04 4.70e-05 -3.86 0.00011 ***
## country -6.95e-03 5.55e-03 -1.25 0.21023
## gender -1.30e-02 9.22e-03 -1.41 0.15830
## age 1.60e-02 4.31e-04 37.16 < 2e-16 ***
## tenure -3.28e-03 1.57e-03 -2.09 0.03664 *
## balance 1.11e-06 7.51e-08 14.72 < 2e-16 ***
## estimated_salary 2.09e-07 8.00e-08 2.61 0.00900 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.21)
##
## Null deviance: 2458.9 on 9999 degrees of freedom
## Residual deviance: 2103.1 on 9992 degrees of freedom
## AIC: 12805
##
## Number of Fisher Scoring iterations: 2
xkabledply( confusion_matrix(churnglm4), title = "Confusion matrix from Logit Model-4" )
| Predicted 0 | Predicted 1 | Total | |
|---|---|---|---|
| Actual 0 | 4619 | 1022 | 5641 |
| Actual 1 | 2002 | 2357 | 4359 |
| Total | 6621 | 3379 | 10000 |
prob4=predict(churnglm4, type = "response" )
churn_data_logit$prob=prob4
h4 <- roc(churn~prob, data = churn_data_logit)
auc(h4)
## Area under the curve: 0.745
plot(h4)
churnglm5 <- glm(churn ~ credit_score + country+ gender+age+tenure+balance+active_member+estimated_salary
, data = churn_data_logit)
summary(churnglm5)
##
## Call:
## glm(formula = churn ~ credit_score + country + gender + age +
## tenure + balance + active_member + estimated_salary, data = churn_data_logit)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.306 -0.373 -0.181 0.454 0.992
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.86e-01 3.84e-02 -4.84 1.3e-06 ***
## credit_score -1.82e-04 4.70e-05 -3.87 0.00011 ***
## country -6.99e-03 5.55e-03 -1.26 0.20754
## gender -1.28e-02 9.22e-03 -1.38 0.16618
## age 1.60e-02 4.31e-04 37.14 < 2e-16 ***
## tenure -3.31e-03 1.57e-03 -2.11 0.03491 *
## balance 1.11e-06 7.51e-08 14.73 < 2e-16 ***
## active_member 1.04e-02 9.19e-03 1.13 0.25681
## estimated_salary 2.09e-07 8.00e-08 2.61 0.00902 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.21)
##
## Null deviance: 2458.9 on 9999 degrees of freedom
## Residual deviance: 2102.9 on 9991 degrees of freedom
## AIC: 12806
##
## Number of Fisher Scoring iterations: 2
xkabledply( confusion_matrix(churnglm5), title = "Confusion matrix from Logit Model-5" )
| Predicted 0 | Predicted 1 | Total | |
|---|---|---|---|
| Actual 0 | 4624 | 1017 | 5641 |
| Actual 1 | 1994 | 2365 | 4359 |
| Total | 6618 | 3382 | 10000 |
prob5=predict(churnglm5, type = "response" )
churn_data_logit$prob=prob5
h5 <- roc(churn~prob, data = churn_data_logit)
auc(h5)
## Area under the curve: 0.745
plot(h5)